Skip to content

Fix vllm gpu arch detection#120

Open
maryamtahhan wants to merge 5 commits into
redhat-et:mainfrom
maryamtahhan:fix-vllm-gpu-arch-detection
Open

Fix vllm gpu arch detection#120
maryamtahhan wants to merge 5 commits into
redhat-et:mainfrom
maryamtahhan:fix-vllm-gpu-arch-detection

Conversation

@maryamtahhan
Copy link
Copy Markdown
Collaborator

@maryamtahhan maryamtahhan commented Apr 15, 2026

Summary by CodeRabbit

  • New Features

    • Added support for a third vLLM cache format (AOT/Mega AOT) with standardized cache layout and AOT artifact metadata exposure.
  • Bug Fixes

    • Improved GPU detection using system properties plus cache metadata, normalized architecture comparisons, clearer warnings on detection failures, and preflight validation for AOT caches.
  • Documentation

    • Expanded vLLM cache docs with three-way format comparisons, end-to-end workflow, practical guide, troubleshooting, and AOT guidance.
  • Chores

    • Minor spelling-ignore list update.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

📝 Walkthrough

Walkthrough

Adds vLLM AOT compile cache detection and metadata, expands cache format support to include "aot_compile", augments GPU probing to detect actual runtime backend/toolkit/ptx, normalizes architecture comparisons, updates preflight validation for three cache formats, and substantially expands vLLM cache documentation and workflow guidance.

Changes

Cohort / File(s) Summary
Documentation
mcv/docs/vllm-binary-cache.md
Expanded doc to cover three vLLM cache formats (triton, binary, aot_compile), Torch compile behavior, hardware detection semantics, format comparisons, end-to-end capture→OCI→extract workflow, practical guide, troubleshooting, and advanced topics.
vLLM cache detection & metadata
mcv/pkg/cache/vllm.go, mcv/pkg/cache/types.go
Detects AOT artifacts under torch_compile_cache/torch_aot_compile/.../model; adds AOTCompileCacheMetadata and AOTCompileEntries to VLLMCacheMetadata; records PTX/CUDA/ROCm/toolkit fields; introduces detectActualGPUInfo() and backend constants (rocm, hip, UnknownBackend).
Preflight validation & utils
mcv/pkg/preflightcheck/vllm.go, mcv/pkg/preflightcheck/triton.go, mcv/pkg/preflightcheck/utils.go
Dispatches validation by CacheFormat (adds "aot_compile"), adds lightweight AOT entry checks, tightens binary validation to backend+warp-size, normalizes CUDA arch comparisons (strip sm_), and adds debug logging.
GPU device modules
mcv/pkg/accelerator/devices/nvml.go, mcv/pkg/accelerator/devices/amd.go, mcv/pkg/accelerator/devices/rocm.go
When cached Triton GPU info is provided and device map is nil, rebuilds d.devices from cached TritonGPUInfo entries; SetSummaries updates matching GPUDevice.Summary by parsing summary IDs. Minor NVML comment tweak.
Precommit config
gkm-codespell.precommit-toml
Added ERRO to codespell ignore-words-list.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant Detector as Cache Detection (mcv/pkg/cache/vllm.go)
    participant GPUProbe as GPU Probe (detectActualGPUInfo / nvml / rocm / amd)
    participant Summary as Summary Builder (vllm.go)
    participant Preflight as Preflight Validation (mcv/pkg/preflightcheck)

    User->>Detector: DetectVLLMCache()
    Detector->>Detector: scan torch_compile_cache/ and torch_aot_compile/
    alt aot_compile found
        Detector->>Detector: detectAOTCompileCache() → collect AOT entries grouped by hash
        Detector-->>User: VLLMCacheMetadata{cacheFormat:"aot_compile", aot_entries}
    else binary/triton found
        Detector->>Summary: buildBinaryCacheSummary()
        Summary->>GPUProbe: detectActualGPUInfo()
        GPUProbe-->>Summary: backend, arch, warpSize, ptxVersion, cuda/rocm versions
        Summary-->>Detector: annotated target summaries
        Detector-->>User: VLLMCacheMetadata[] (binary/triton)
    end

    User->>Preflight: CompareVLLMCacheManifestToGPU(manifest)
    Preflight->>Preflight: switch on manifest.CacheFormat
    alt aot_compile
        Preflight->>Preflight: compareAOTCompileCacheEntriesToGPU() (presence/size checks)
    else binary
        Preflight->>Preflight: compareBinaryCacheEntriesToGPU() (backend + warp-size check)
    else triton
        Preflight->>Preflight: CompareTritonEntriesToGPU() (normalized arch compare)
    end
    Preflight-->>User: validation result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.23% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Fix vllm gpu arch detection' directly aligns with the core changes: architecture normalization logic is added across multiple files (utils.go, triton.go, vllm.go, preflightcheck files) to fix GPU architecture detection and comparison. This is the primary functional objective of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (5)
mcv/pkg/preflightcheck/vllm.go (2)

119-124: Use logging instead of fmt.Printf for consistency.

Same issue as above - use logging.Debugf for consistency with the rest of the codebase.

♻️ Suggested fix
 		if backendMatches && warpMatches {
 			matched = true
 			// For detailed arch compatibility, rely on Summary label check
-			fmt.Printf("Binary cache entry matches GPU: backend=%s, warpSize=%d\n",
-				backend, expectedWarpSize)
+			logging.Debugf("Binary cache entry matches GPU: backend=%s, warpSize=%d",
+				backend, expectedWarpSize)
 			break
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcv/pkg/preflightcheck/vllm.go` around lines 119 - 124, The code uses
fmt.Printf in the GPU-matching branch (inside the block checking backendMatches
and warpMatches where matched is set) which breaks logging consistency; replace
the fmt.Printf call that prints "Binary cache entry matches GPU: backend=%s,
warpSize=%d\n" with logging.Debugf using the same formatted message and
variables (backend, expectedWarpSize), keeping the matched=true and break
behavior unchanged and ensuring any fmt import is removed if no longer used.

76-80: Use logging instead of fmt.Printf for consistency.

The rest of the codebase uses the logging package. Using fmt.Printf here is inconsistent and won't respect log level settings.

♻️ Suggested fix
 	// Log the AOT cache entries for debugging
 	for _, entry := range entries {
-		fmt.Printf("AOT compile cache: hash=%s, rank=%s, size=%d bytes\n",
-			entry.Hash, entry.Rank, entry.FileSize)
+		logging.Debugf("AOT compile cache: hash=%s, rank=%s, size=%d bytes",
+			entry.Hash, entry.Rank, entry.FileSize)
 	}

Note: You'll need to add the logging import.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcv/pkg/preflightcheck/vllm.go` around lines 76 - 80, Replace the fmt.Printf
usage in the AOT cache logging loop with the project's logging package: locate
the loop that iterates over entries (for _, entry := range entries) and change
the print to use logging.<level> (e.g., logging.Debug or logging.Infof depending
on desired verbosity) and include entry.Hash, entry.Rank and entry.FileSize in
the log message; also add the logging import to the file. Ensure the message
format and log level are consistent with other uses of logging in this package.
mcv/pkg/cache/vllm.go (2)

520-523: Move GPU detection outside the inner loop to avoid redundant calls.

detectActualGPUInfo() is called once per VLLMCacheMetadata entry, but it performs expensive operations (config initialization, device registry access, device startup). Since GPU info doesn't change between iterations, move the call outside both loops.

♻️ Suggested fix
 func buildBinaryCacheSummary(metadata []VLLMCacheMetadata) (*Summary, error) {
 	targetMap := make(map[string]SummaryTargetInfo)
+
+	// Detect actual GPU from the system once (not per entry)
+	// NOTE: We detect the actual system GPU rather than trusting VLLM_TARGET_DEVICE
+	// because caches may be copied from other systems
+	detectedBackend, detectedArch, detectedWarpSize, detectedPTX := detectActualGPUInfo()

 	for _, meta := range metadata {
 		if meta.CacheFormat != BinaryCacheFormat {
 			continue
 		}

-		// Detect actual GPU from the system once per metadata entry
-		// NOTE: We detect the actual system GPU rather than trusting VLLM_TARGET_DEVICE
-		// because caches may be copied from other systems
-		detectedBackend, detectedArch, detectedWarpSize, detectedPTX := detectActualGPUInfo()
-
 		for i := range meta.BinaryCacheEntries {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcv/pkg/cache/vllm.go` around lines 520 - 523, The code currently calls
detectActualGPUInfo() inside the loop that iterates VLLMCacheMetadata, causing
expensive duplicate GPU detection; move the call so it runs once before entering
the outer/inner loops (call detectActualGPUInfo() once and store
detectedBackend, detectedArch, detectedWarpSize, detectedPTX in local variables)
and then use those stored values inside the loop(s) instead of calling
detectActualGPUInfo() repeatedly; update any references in the loop to use the
precomputed variables and remove the inner detectActualGPUInfo() invocation.

350-350: Move regex compilation outside the loop.

The rankDirRegex is recompiled on every iteration of the outer loop. Move it to a package-level variable or outside the loop.

♻️ Suggested fix
+var aotRankDirRegex = regexp.MustCompile(`^rank_\d+_\d+$`)
+
 func detectAOTCompileCache(aotPath string) ([]AOTCompileCacheMetadata, error) {
 	var aotCaches []AOTCompileCacheMetadata
 	// ...
 	for _, hashEntry := range entries {
 		// ...
-		rankDirRegex := regexp.MustCompile(`^rank_\d+_\d+$`)
 		for _, rankEntry := range rankEntries {
 			// ...
-			if !rankDirRegex.MatchString(rankEntry.Name()) {
+			if !aotRankDirRegex.MatchString(rankEntry.Name()) {
 				continue
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcv/pkg/cache/vllm.go` at line 350, The regex rankDirRegex is being compiled
inside a loop causing repeated allocations; move its compilation out of the loop
(e.g., declare a package-level var or compute once before the loop) and reuse
that compiled *regexp. Ensure you replace the inline
regexp.MustCompile(`^rank_\d+_\d+$`) inside the loop with the shared variable
(referencing rankDirRegex) used by the functions/methods in vllm.go that iterate
directories.
mcv/docs/vllm-binary-cache.md (1)

747-747: Add language specifier to fenced code block.

The static analysis flagged this code block as missing a language specifier.

♻️ Suggested fix
-```
+```text
 ┌─────────────────────────┐
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcv/docs/vllm-binary-cache.md` at line 747, The fenced code block containing
the ASCII art is missing a language specifier; update the opening fence from ```
to ```text (or another appropriate language) for the ASCII art block in
vllm-binary-cache.md so static analysis recognizes it — locate the ASCII-art
fenced block (the triple-backtick surrounding "┌─────────────────────────┐") and
change the opening fence to include "text".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@mcv/docs/vllm-binary-cache.md`:
- Around line 331-344: The docs describe the "arch" field as a numeric compute
capability (e.g., "75") but nvml.go now outputs the CUDA SM string with an "sm_"
prefix (e.g., "sm_75"); update the documentation text and all examples to state
that arch contains the SM string (format "sm_<nnn>"), change illustrative
examples from "75" to "sm_75" (and similar for other GPUs), and mention that
this format matches nvml.go's output and NVIDIA convention so readers know to
expect the "sm_" prefix when reading or comparing the arch field.
- Line 691: The docs contain the literal header string "ERRO" which triggers
codespell; if the log format permits, update the header string "ERRO Preflight
check failed: no compatible GPU found" (and the other occurrence around the 707
example) to "ERROR Preflight check failed: no compatible GPU found"; if the
header must remain as-is for fidelity, add "ERRO" to the codespell ignore list
(or adjust the example formatting) so codespell no longer flags the literal
"ERRO" token in mcv/docs/vllm-binary-cache.md.

In `@mcv/pkg/accelerator/devices/nvml.go`:
- Line 225: Normalize GPU architecture strings at comparison sites instead of
changing stored cache values: update CompareTritonCacheToGPU to compare
ConvertArchToString(cacheData.Target.Arch) against a normalized form of gpu.Arch
(strip any "sm_" prefix or non-digit chars), and apply the same normalization in
preflight checks where entry.Arch == gpuInfo.Arch (in
mcv/pkg/preflightcheck/triton.go) and target.Arch == gpu.Arch (in
mcv/pkg/preflightcheck/utils.go); implement a small helper (e.g.,
NormalizeArchString or NormalizeGPUArch) and call it in those comparisons so
legacy cache values like "75" match new detector output "sm_75".

In `@mcv/pkg/cache/vllm.go`:
- Around line 538-557: When backend == "unknown" in the fallback block, also
populate the arch variable from cache metadata so downstream comparisons in
CompareCacheSummaryLabelToGPU don't all fail; retrieve arch from the binary
cache metadata (e.g., binaryCache.Arch) or read environment keys like
VLLM_PAGED_ATTN_ARCH (or any existing cache arch fields) and set arch
accordingly before setting warpSize and backend, falling back to a sensible
default (not "unknown") if none is available.

---

Nitpick comments:
In `@mcv/docs/vllm-binary-cache.md`:
- Line 747: The fenced code block containing the ASCII art is missing a language
specifier; update the opening fence from ``` to ```text (or another appropriate
language) for the ASCII art block in vllm-binary-cache.md so static analysis
recognizes it — locate the ASCII-art fenced block (the triple-backtick
surrounding "┌─────────────────────────┐") and change the opening fence to
include "text".

In `@mcv/pkg/cache/vllm.go`:
- Around line 520-523: The code currently calls detectActualGPUInfo() inside the
loop that iterates VLLMCacheMetadata, causing expensive duplicate GPU detection;
move the call so it runs once before entering the outer/inner loops (call
detectActualGPUInfo() once and store detectedBackend, detectedArch,
detectedWarpSize, detectedPTX in local variables) and then use those stored
values inside the loop(s) instead of calling detectActualGPUInfo() repeatedly;
update any references in the loop to use the precomputed variables and remove
the inner detectActualGPUInfo() invocation.
- Line 350: The regex rankDirRegex is being compiled inside a loop causing
repeated allocations; move its compilation out of the loop (e.g., declare a
package-level var or compute once before the loop) and reuse that compiled
*regexp. Ensure you replace the inline regexp.MustCompile(`^rank_\d+_\d+$`)
inside the loop with the shared variable (referencing rankDirRegex) used by the
functions/methods in vllm.go that iterate directories.

In `@mcv/pkg/preflightcheck/vllm.go`:
- Around line 119-124: The code uses fmt.Printf in the GPU-matching branch
(inside the block checking backendMatches and warpMatches where matched is set)
which breaks logging consistency; replace the fmt.Printf call that prints
"Binary cache entry matches GPU: backend=%s, warpSize=%d\n" with logging.Debugf
using the same formatted message and variables (backend, expectedWarpSize),
keeping the matched=true and break behavior unchanged and ensuring any fmt
import is removed if no longer used.
- Around line 76-80: Replace the fmt.Printf usage in the AOT cache logging loop
with the project's logging package: locate the loop that iterates over entries
(for _, entry := range entries) and change the print to use logging.<level>
(e.g., logging.Debug or logging.Infof depending on desired verbosity) and
include entry.Hash, entry.Rank and entry.FileSize in the log message; also add
the logging import to the file. Ensure the message format and log level are
consistent with other uses of logging in this package.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 2876a5ef-e373-433f-a07f-fdc9efcc2653

📥 Commits

Reviewing files that changed from the base of the PR and between acd5abc and d34bd9b.

📒 Files selected for processing (5)
  • mcv/docs/vllm-binary-cache.md
  • mcv/pkg/accelerator/devices/nvml.go
  • mcv/pkg/cache/types.go
  • mcv/pkg/cache/vllm.go
  • mcv/pkg/preflightcheck/vllm.go

Comment thread mcv/docs/vllm-binary-cache.md
Comment thread mcv/pkg/accelerator/devices/nvml.go Outdated
@maryamtahhan maryamtahhan force-pushed the fix-vllm-gpu-arch-detection branch 2 times, most recently from 775ea3e to 3b60594 Compare April 15, 2026 10:55
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
mcv/pkg/preflightcheck/vllm.go (2)

76-80: Use logging framework instead of fmt.Printf for consistency.

The codebase uses logging.Infof/logging.Debugf (logrus) for output. Using fmt.Printf bypasses log levels and formatting.

♻️ Proposed fix
 	// Log the AOT cache entries for debugging
 	for _, entry := range entries {
-		fmt.Printf("AOT compile cache: hash=%s, rank=%s, size=%d bytes\n",
+		logging.Debugf("AOT compile cache: hash=%s, rank=%s, size=%d bytes",
 			entry.Hash, entry.Rank, entry.FileSize)
 	}

This requires adding the logging import if not already present.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcv/pkg/preflightcheck/vllm.go` around lines 76 - 80, Replace the fmt.Printf
calls in the AOT cache logging loop with the project's logging package (use
logging.Infof or logging.Debugf as appropriate) so log levels and formatting are
consistent; update the loop that iterates over entries (the for _, entry :=
range entries block) to call logging.Debugf("AOT compile cache: hash=%s,
rank=%s, size=%d bytes", entry.Hash, entry.Rank, entry.FileSize) and add the
logging import if it's not already present.

119-124: Use logging framework here as well.

Same issue as above—use logging.Debugf for consistency.

♻️ Proposed fix
 		if backendMatches && warpMatches {
 			matched = true
 			// For detailed arch compatibility, rely on Summary label check
-			fmt.Printf("Binary cache entry matches GPU: backend=%s, warpSize=%d\n",
+			logging.Debugf("Binary cache entry matches GPU: backend=%s, warpSize=%d",
 				backend, expectedWarpSize)
 			break
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcv/pkg/preflightcheck/vllm.go` around lines 119 - 124, Replace the
fmt.Printf call in the vllm preflight loop (where backendMatches and warpMatches
are checked and matched is set) with the logging framework: call logging.Debugf
to emit the same message and values (backend, expectedWarpSize), and add or
ensure the logging package is imported; update the print site in the code block
that currently prints "Binary cache entry matches GPU: backend=%s, warpSize=%d"
so it uses logging.Debugf for consistency with other debug output.
mcv/docs/vllm-binary-cache.md (1)

747-747: Add language specifier to fenced code block.

Static analysis flags this block as missing a language specifier. Since it's an ASCII diagram, use text or plaintext.

♻️ Proposed fix
-```
+```text
 ┌─────────────────────────┐
 │  1. Start vLLM          │
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcv/docs/vllm-binary-cache.md` at line 747, The fenced code block containing
the ASCII diagram (the block starting with triple backticks and lines like
"┌─────────────────────────┐" and "│  1. Start vLLM          │") is missing a
language specifier; update the opening fence from ``` to ```text (or
```plaintext) so the block becomes a text/plaintext fenced code block to satisfy
static analysis.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@mcv/docs/vllm-binary-cache.md`:
- Around line 14-16: Update the docs paragraph in mcv/docs/vllm-binary-cache.md
that claims "Not Currently Supported" for VLLM_USE_AOT_COMPILE=1 to state that
AOT compile is supported; change the wording to describe the supported
"aot_compile" CacheFormat and mention any caveats (e.g., validation rules
enforced by preflight check). Reference the existing implementation: detection
logic in mcv/pkg/cache/vllm.go (which sets CacheFormat: "aot_compile" and
detects torch_aot_compile/), the validation in mcv/pkg/preflightcheck/vllm.go
that recognizes "aot_compile", and the AOTCompileCacheMetadata type used for AOT
artifacts; update the doc to reflect support and include any limitations or
validation requirements.

In `@mcv/pkg/preflightcheck/triton.go`:
- Around line 66-69: entry.Arch is typed as any but normalizeArchForComparison
requires strings; before calling normalizeArchForComparison on entry.Arch,
perform a type assertion to string (e.g., archStr, ok := entry.Arch.(string))
and if the assertion fails convert it to a string safely (e.g.,
fmt.Sprint(entry.Arch) or set to empty) then pass that string to
normalizeArchForComparison; update the two lines using
normalizeArchForComparison(entry.Backend, entry.Arch) to use the
asserted/converted string variable (keep gpuInfo.Backend/gpuInfo.Arch as-is) so
the call signatures match.

In `@mcv/pkg/preflightcheck/vllm.go`:
- Around line 40-54: entry.TritonCacheEntries contains JSON-unmarshalled
map[string]interface{} values so the type assertion to cache.TritonCacheMetadata
will always fail; update the loop that builds convertedEntries (referencing
entry.TritonCacheEntries and cache.TritonCacheMetadata) to convert each element
by re-marshaling the element to JSON and then unmarshaling into a
cache.TritonCacheMetadata (or perform a manual map->struct conversion), handle
and return any marshal/unmarshal errors, and then call
CompareTritonEntriesToGPU(convertedEntries, devInfo) as before.

---

Nitpick comments:
In `@mcv/docs/vllm-binary-cache.md`:
- Line 747: The fenced code block containing the ASCII diagram (the block
starting with triple backticks and lines like "┌─────────────────────────┐" and
"│  1. Start vLLM          │") is missing a language specifier; update the
opening fence from ``` to ```text (or ```plaintext) so the block becomes a
text/plaintext fenced code block to satisfy static analysis.

In `@mcv/pkg/preflightcheck/vllm.go`:
- Around line 76-80: Replace the fmt.Printf calls in the AOT cache logging loop
with the project's logging package (use logging.Infof or logging.Debugf as
appropriate) so log levels and formatting are consistent; update the loop that
iterates over entries (the for _, entry := range entries block) to call
logging.Debugf("AOT compile cache: hash=%s, rank=%s, size=%d bytes", entry.Hash,
entry.Rank, entry.FileSize) and add the logging import if it's not already
present.
- Around line 119-124: Replace the fmt.Printf call in the vllm preflight loop
(where backendMatches and warpMatches are checked and matched is set) with the
logging framework: call logging.Debugf to emit the same message and values
(backend, expectedWarpSize), and add or ensure the logging package is imported;
update the print site in the code block that currently prints "Binary cache
entry matches GPU: backend=%s, warpSize=%d" so it uses logging.Debugf for
consistency with other debug output.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: d50b4c4e-5004-48ec-b8f6-b315fa99ed93

📥 Commits

Reviewing files that changed from the base of the PR and between d34bd9b and 3b60594.

📒 Files selected for processing (8)
  • gkm-codespell.precommit-toml
  • mcv/docs/vllm-binary-cache.md
  • mcv/pkg/accelerator/devices/nvml.go
  • mcv/pkg/cache/types.go
  • mcv/pkg/cache/vllm.go
  • mcv/pkg/preflightcheck/triton.go
  • mcv/pkg/preflightcheck/utils.go
  • mcv/pkg/preflightcheck/vllm.go
✅ Files skipped from review due to trivial changes (3)
  • gkm-codespell.precommit-toml
  • mcv/pkg/accelerator/devices/nvml.go
  • mcv/pkg/cache/vllm.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • mcv/pkg/cache/types.go

Comment thread mcv/docs/vllm-binary-cache.md Outdated
Comment thread mcv/pkg/preflightcheck/triton.go
Comment thread mcv/pkg/preflightcheck/vllm.go
@maryamtahhan maryamtahhan force-pushed the fix-vllm-gpu-arch-detection branch 6 times, most recently from eb76dd9 to eec4df9 Compare April 15, 2026 11:52
@maryamtahhan maryamtahhan requested a review from cmagina April 15, 2026 12:02
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
mcv/pkg/cache/vllm.go (1)

540-542: ⚠️ Potential issue | 🟡 Minor

Fix fallback path to update arch when GPU detection fails.

Lines 552-554 update the backend when detection fails, but leave arch as the string "UnknownBackend" from failed detection. This causes malformed arch strings like "sm_UnknownBackend" (line 541) when the fallback backend resolves to CUDA.

The arch should either be:

  1. Reset to a sensible default (e.g., "75" for CUDA), or
  2. Extracted from cache metadata alongside the backend
Example fix: update arch in fallback path
 			// Handle special cases where no GPU is detected
 			if backend == UnknownBackend {
 				logging.Warn("Could not detect GPU on system, using cache metadata as fallback")
 				// Fallback to cache metadata if GPU detection failed
 				backend = binaryCache.TargetDevice
 				if backend == "" {
 					backend = CUDABackend // Default if not specified
 				}
+				// Reset arch to default since detection failed
+				arch = "75" // Safe default; consider extracting from cache if available
 				// Set default warp sizes
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcv/pkg/cache/vllm.go` around lines 540 - 542, The fallback path that resets
backend leaves arch set to the failed-detection sentinel (producing values like
"sm_UnknownBackend"); update the fallback logic so that when backend is
reassigned to CUDABackend you also update arch to a sensible CUDA default (e.g.,
"75") or load the arch from the cache metadata at the same time; modify the code
handling variables backend and arch (the block that currently checks backend ==
CUDABackend and formats arch with "sm_%s") so arch is corrected when the
fallback backend is applied.
🧹 Nitpick comments (1)
mcv/pkg/cache/vllm.go (1)

524-527: GPU detection called redundantly inside loop.

detectActualGPUInfo() is called inside the metadata loop (line 527), meaning it's invoked once per VLLMCacheMetadata entry. Since the system's GPU doesn't change during execution, this detection should be performed once before the loop.

♻️ Suggested refactor to detect GPU once
 func buildBinaryCacheSummary(metadata []VLLMCacheMetadata) (*Summary, error) {
 	targetMap := make(map[string]SummaryTargetInfo)

+	// Detect actual GPU from the system once
+	// NOTE: We detect the actual system GPU rather than trusting VLLM_TARGET_DEVICE
+	// because caches may be copied from other systems
+	detectedBackend, detectedArch, detectedWarpSize, detectedPTX := detectActualGPUInfo()
+
 	for _, meta := range metadata {
 		if meta.CacheFormat != BinaryCacheFormat {
 			continue
 		}

-		// Detect actual GPU from the system once per metadata entry
-		// NOTE: We detect the actual system GPU rather than trusting VLLM_TARGET_DEVICE
-		// because caches may be copied from other systems
-		detectedBackend, detectedArch, detectedWarpSize, detectedPTX := detectActualGPUInfo()
-
 		for i := range meta.BinaryCacheEntries {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcv/pkg/cache/vllm.go` around lines 524 - 527, detectActualGPUInfo() is being
called inside the per-entry metadata loop (where VLLMCacheMetadata entries are
processed), causing redundant system GPU detection; move the call out of the
loop so detectedBackend, detectedArch, detectedWarpSize, detectedPTX are
computed once before iterating over the metadata slice/loop and reuse those
values for each VLLMCacheMetadata entry. Update any references inside the loop
to use the precomputed variables and remove the in-loop call to
detectActualGPUInfo().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@mcv/docs/vllm-binary-cache.md`:
- Around line 14-20: The documentation contains contradictory statements about
AOT compile support; update all passages that say `VLLM_USE_AOT_COMPILE=1` is
"not currently supported" to reflect that AOT compile is supported by MCV now
(since the PR adds detectAOTCompileCache and AOT validation in preflight
checks). Search for occurrences mentioning AOT support and consistently state
that MCV supports the AOT workflow, describe the created cache path and that
compatibility is validated using the summary label during preflight, and mention
the new detection/validation functionality by name (`detectAOTCompileCache` and
the AOT validation in preflight checks) so readers know the code path that
enforces support.
- Around line 335-348: Update the documentation example to match the actual code
output: change the example JSON `"arch": "75"` to `"arch": "sm_75"` and add a
short note that CUDA architectures are formatted with an "sm_" prefix (this
formatting comes from the arch = fmt.Sprintf("sm_%s", arch) logic in
mcv/pkg/cache/vllm.go).

In `@mcv/pkg/cache/vllm.go`:
- Around line 548-567: When backend == UnknownBackend after
detectActualGPUInfo(), also populate arch from cache metadata or env fallback
instead of leaving it as "UnknownBackend": read arch from
binaryCache.Environment["VLLM_PAGED_ATTN_ARCH"] (or
os.Getenv("VLLM_PAGED_ATTN_ARCH")) and assign it to arch if non-empty; if that
is empty, derive a sensible default from binaryCache.TargetDevice or backend and
normalize it to match the values used by CompareCacheSummaryLabelToGPU so
architecture comparisons succeed; ensure this logic sits in the same block that
sets backend and warpSize so Summary label uses the corrected arch.

---

Duplicate comments:
In `@mcv/pkg/cache/vllm.go`:
- Around line 540-542: The fallback path that resets backend leaves arch set to
the failed-detection sentinel (producing values like "sm_UnknownBackend");
update the fallback logic so that when backend is reassigned to CUDABackend you
also update arch to a sensible CUDA default (e.g., "75") or load the arch from
the cache metadata at the same time; modify the code handling variables backend
and arch (the block that currently checks backend == CUDABackend and formats
arch with "sm_%s") so arch is corrected when the fallback backend is applied.

---

Nitpick comments:
In `@mcv/pkg/cache/vllm.go`:
- Around line 524-527: detectActualGPUInfo() is being called inside the
per-entry metadata loop (where VLLMCacheMetadata entries are processed), causing
redundant system GPU detection; move the call out of the loop so
detectedBackend, detectedArch, detectedWarpSize, detectedPTX are computed once
before iterating over the metadata slice/loop and reuse those values for each
VLLMCacheMetadata entry. Update any references inside the loop to use the
precomputed variables and remove the in-loop call to detectActualGPUInfo().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: d7159a6d-5287-4a2d-b977-f2c566e1aaff

📥 Commits

Reviewing files that changed from the base of the PR and between 3b60594 and eec4df9.

📒 Files selected for processing (10)
  • gkm-codespell.precommit-toml
  • mcv/docs/vllm-binary-cache.md
  • mcv/pkg/accelerator/devices/amd.go
  • mcv/pkg/accelerator/devices/nvml.go
  • mcv/pkg/accelerator/devices/rocm.go
  • mcv/pkg/cache/types.go
  • mcv/pkg/cache/vllm.go
  • mcv/pkg/preflightcheck/triton.go
  • mcv/pkg/preflightcheck/utils.go
  • mcv/pkg/preflightcheck/vllm.go
✅ Files skipped from review due to trivial changes (1)
  • gkm-codespell.precommit-toml
🚧 Files skipped from review as they are similar to previous changes (2)
  • mcv/pkg/preflightcheck/triton.go
  • mcv/pkg/accelerator/devices/nvml.go

Comment on lines +14 to +20
**AOT Compile Support**: MCV **supports** the `VLLM_USE_AOT_COMPILE=1` workflow,
which creates a separate cache structure at
`torch_compile_cache/torch_aot_compile/{hash}/rank_{rank}_{dp_rank}/model`.
AOT compile caches store ahead-of-time compiled models as single binary files
rather than multiple compilation artifacts. During preflight checks, AOT cache
compatibility is validated primarily via the summary label, as the cache metadata
contains limited hardware information.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Documentation contains contradictory statements about AOT compile support.

Lines 14-20 state that MCV "supports the VLLM_USE_AOT_COMPILE=1 workflow", but lines 35-37 state the same workflow is "not currently supported by MCV". Similar contradictions appear at lines 111-112 and 511-513.

Please resolve this inconsistency. Based on the code changes in this PR (which add detectAOTCompileCache and AOT validation in preflight checks), AOT compile appears to be supported. Update all sections to reflect the actual support status consistently.

Also applies to: 35-37

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcv/docs/vllm-binary-cache.md` around lines 14 - 20, The documentation
contains contradictory statements about AOT compile support; update all passages
that say `VLLM_USE_AOT_COMPILE=1` is "not currently supported" to reflect that
AOT compile is supported by MCV now (since the PR adds detectAOTCompileCache and
AOT validation in preflight checks). Search for occurrences mentioning AOT
support and consistently state that MCV supports the AOT workflow, describe the
created cache path and that compatibility is validated using the summary label
during preflight, and mention the new detection/validation functionality by name
(`detectAOTCompileCache` and the AOT validation in preflight checks) so readers
know the code path that enforces support.

Comment thread mcv/docs/vllm-binary-cache.md
Comment thread mcv/pkg/cache/vllm.go
Comment on lines +548 to +567
// Handle special cases where no GPU is detected
if backend == UnknownBackend {
logging.Warn("Could not detect GPU on system, using cache metadata as fallback")
// Fallback to cache metadata if GPU detection failed
backend = binaryCache.TargetDevice
if backend == "" {
backend = CUDABackend // Default if not specified
}
// Set default warp sizes
switch backend {
case ROCmBackend, HIPBackend:
warpSize = 64
case CUDABackend:
warpSize = 32
case "tpu":
warpSize = 128
case "cpu":
warpSize = 1
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fallback path still leaves arch potentially invalid.

When GPU detection fails (backend == UnknownBackend), the code falls back to cache metadata for backend and warpSize, but arch remains as the value returned from detectActualGPUInfo() - which is "UnknownBackend" on failure (line 452, 460, 467, 474, 481).

This means the Summary label will contain arch: "UnknownBackend", causing all architecture comparisons in CompareCacheSummaryLabelToGPU to fail. Consider extracting arch from cache environment (e.g., from VLLM_PAGED_ATTN_ARCH or similar env var if available) as a secondary fallback, or document this limitation clearly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcv/pkg/cache/vllm.go` around lines 548 - 567, When backend == UnknownBackend
after detectActualGPUInfo(), also populate arch from cache metadata or env
fallback instead of leaving it as "UnknownBackend": read arch from
binaryCache.Environment["VLLM_PAGED_ATTN_ARCH"] (or
os.Getenv("VLLM_PAGED_ATTN_ARCH")) and assign it to arch if non-empty; if that
is empty, derive a sensible default from binaryCache.TargetDevice or backend and
normalize it to match the values used by CompareCacheSummaryLabelToGPU so
architecture comparisons succeed; ensure this logic sits in the same block that
sets backend and warpSize so Summary label uses the corrected arch.

maryamtahhan and others added 5 commits April 22, 2026 10:40
This commit fixes a critical bug in vLLM cache image creation where the
GPU compute capability was incorrectly derived from the CUDA toolkit
version instead of detecting the actual GPU hardware.

## Bug Fixed

Previously, mcv would read `VLLM_MAIN_CUDA_VERSION` (e.g., "12.9") from
the cache metadata and use it as the compute capability (e.g., "sm_12.9").
This is incorrect because:
- CUDA version != compute capability
- sm_12.9 doesn't exist (valid examples: sm_75, sm_80, sm_89)
- Preflight checks would always fail due to architecture mismatch

## Changes

1. **GPU Detection**:
   - Added `detectActualGPUInfo()` to detect real GPU hardware on the system
   - Detects backend (CUDA/ROCm), architecture, warp size, and PTX version
   - Works for both NVIDIA (CUDA) and AMD (ROCm) GPUs
   - Handles cases where cache is copied from another system

2. **Summary Metadata**:
   - Added `PTXVersion` and `CUDAVersion` fields to `SummaryTargetInfo`
   - Captures both actual GPU arch (e.g., sm_75) and toolkit version (e.g., "12.9")
   - Enables accurate compatibility checking

3. **Robust Detection**:
   - Detects actual GPUs first, doesn't trust `VLLM_TARGET_DEVICE` from copied caches
   - Falls back to cache metadata only if GPU detection fails
   - Logs detailed information for debugging

## Documentation Updates

- Fixed Hardware Detection section to show correct architecture detection
- Added complete end-to-end workflow example:
  * Starting vLLM container
  * Waiting for cache generation (happens at startup, not first inference)
  * Capturing cache from container
  * Building OCI image with mcv
  * Pushing to registry
  * Extracting with preflight checks
  * Verifying GPU compatibility
- Clarified that cache compilation happens during vLLM startup, not on first request

## Example

Before (incorrect):
```json
{"backend": "cuda", "arch": "sm_12.9", "warp_size": 32}
```

After (correct):
```json
{
  "backend": "cuda",
  "arch": "sm_75",
  "warp_size": 32,
  "ptx_version": 75,
  "cuda_version": "12.9"
}
```

Now preflight checks correctly match Tesla T4 (sm_75) caches with Tesla T4 GPUs.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
This commit includes two related improvements to vLLM cache handling:

1. **NVIDIA Architecture Format Fix**
   - Changed Arch field from numeric (e.g., '75') to proper format ('sm_75')
   - Matches vLLM/PyTorch conventions for GPU architecture strings
   - Ensures correct preflight check validation

2. **AOT Compile Cache Support**
   - Added detection for VLLM_USE_AOT_COMPILE=1 cache format
   - Implements AOTCompileCacheMetadata type
   - Supports torch_aot_compile/{hash}/rank_X_Y/model structure
   - Extends VLLMCacheMetadata to handle 'aot_compile' format
   - Adds preflight validation for AOT caches

Changes enable proper detection and validation of:
- Binary caches (existing)
- Triton caches (existing)
- AOT compile caches (new)

Tested on Tesla T4 (sm_75) - preflight checks pass correctly.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Fixes 7 linting issues: adds constants for backend types (ROCm, HIP, Unknown), simplifies function signature, converts if-else to switch statement, and marks unused parameter.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Replaces CUDA version-based architecture detection with actual GPU
hardware detection via NVML/ROCm libraries. This ensures cache images
are built with the correct architecture for the actual GPU present,
rather than inferring from toolkit versions.

Key changes:
- NVML returns numeric arch format (e.g., "75") for Triton compatibility
- vLLM binary cache adds sm_ prefix for CUDA (e.g., "sm_75") in summary
- Preflight checks normalize arch strings to handle both "75" and "sm_75"
- Added arch normalization function for cross-cache-format compatibility

This approach maintains compatibility with:
- Triton CUDA caches expecting "75"
- vLLM binary caches expecting "sm_75"
- AMD/ROCm caches using "gfx1151"

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Update vllm-binary-cache.md to reflect AOT compile support
- Add text language specifier to ASCII diagram code fence
- Fix TritonCacheEntries type conversion via JSON re-marshalling
- Replace fmt.Printf with logging.Debugf for consistency

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan force-pushed the fix-vllm-gpu-arch-detection branch from eec4df9 to 1702619 Compare April 22, 2026 09:44
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
mcv/pkg/cache/vllm.go (1)

106-116: ⚠️ Potential issue | 🟠 Major

Avoid detecting the same torch_aot_compile artifacts as both binary and aot_compile.

torch_aot_compile/{hash}/rank_X_Y/model is first appended as BinaryCacheFormat via detectMegaAOTEntries, then appended again as "aot_compile" via detectAOTCompileCache. This duplicates manifest entries/counts and can break preflight on ROCm/HIP because the synthetic binary metadata has no TargetDevice, so validation defaults it to CUDA. Keep one representation for this layout—prefer the explicit "aot_compile" path added in this PR.

🐛 Proposed fix direction
-				// Mega-AOT layout wraps per-model hash dirs under
-				// torch_aot_compile/. Recurse one level and treat each
-				// child as a hash dir.
 				if entry.Name() == torchAOTCompileDirName {
-					aotMeta, aotCount := detectMegaAOTEntries(
-						filepath.Join(torchCompileCachePath, entry.Name()),
-					)
-					metadata = append(metadata, aotMeta...)
-					count += aotCount
+					// Handled below as explicit "aot_compile" metadata.
 					continue
 				}

Also applies to: 175-202

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcv/pkg/cache/vllm.go` around lines 106 - 116, The current loop treats the
torch_aot_compile dir as Mega-AOT by calling detectMegaAOTEntries (leading to
BinaryCacheFormat entries) and later also detects it as "aot_compile" via
detectAOTCompileCache, causing duplicate manifest entries; change the logic so
that when entry.Name() == torchAOTCompileDirName you do NOT call
detectMegaAOTEntries and instead skip/continue so only detectAOTCompileCache
produces entries. Update the conditional in the loop (the block referencing
torchAOTCompileDirName, detectMegaAOTEntries, metadata and count) to avoid
appending aot artifacts as BinaryCacheFormat and rely exclusively on
detectAOTCompileCache for that layout (keep torchAOTCompileDirName,
detectAOTCompileCache, metadata, and count references to locate the code).
♻️ Duplicate comments (3)
mcv/docs/vllm-binary-cache.md (2)

14-20: ⚠️ Potential issue | 🟠 Major

Make the AOT compile support status and cache format consistent.

The doc says MCV supports VLLM_USE_AOT_COMPILE=1 at Lines 14-20, but later says it is unsupported at Lines 35-46, 107-112, and 511-513. The format table also maps AOT to "binary", while the implementation now emits/validates cacheFormat: "aot_compile". Please update these sections to consistently describe VLLM_USE_AOT_COMPILE=1 as supported with the explicit "aot_compile" manifest format, and keep “Mega AOT” separate from AOT compile.

📝 Proposed wording direction
-**Note**: The `VLLM_USE_AOT_COMPILE=1` workflow uses a different structure at
-`torch_compile_cache/torch_aot_compile/{hash}/rank_{rank}_{dp_rank}/model` and is
-not currently supported by MCV.
+**Note**: The `VLLM_USE_AOT_COMPILE=1` workflow uses a different structure at
+`torch_compile_cache/torch_aot_compile/{hash}/rank_{rank}_{dp_rank}/model`.
+MCV supports this layout as `cacheFormat: "aot_compile"`; compatibility is
+validated primarily through the image summary label because the AOT metadata
+contains limited hardware details.
-| AOT | Binary files | `"binary"` | `"binary"` | `"binary"` |
+| AOT compile | Single `model` file | `"aot_compile"` | N/A | `"binary"` |

Also applies to: 35-46, 107-112, 477-483, 511-513

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcv/docs/vllm-binary-cache.md` around lines 14 - 20, Update the documentation
so AOT compile (VLLM_USE_AOT_COMPILE=1) is consistently described as supported:
change the "AOT Compile Support" paragraph to state it is supported, replace any
claims that it is unsupported with support wording, and ensure the
manifest/format table and any references use cacheFormat: "aot_compile" instead
of "binary"; keep the separate concept/name "Mega AOT" distinct from AOT compile
in all sections (including the format mapping and compatibility notes) and
ensure the preflight/cache validation text references the summary label and
limited hardware metadata for the "aot_compile" format.

332-348: ⚠️ Potential issue | 🟡 Minor

Update CUDA summary examples to use the sm_ architecture format.

The docs show summary labels with "arch": "75", but buildBinaryCacheSummary prefixes CUDA architecture as sm_, so the emitted summary is expected to look like "arch": "sm_75". Please align the examples and the explanatory text with the actual label format.

📝 Proposed documentation fix
 {
   "backend": "cuda",
-  "arch": "75",
+  "arch": "sm_75",
   "warp_size": 32,
   "ptx_version": 590,
   "cuda_version": "12.9"
 }
-**Important**: Notice that the `arch` field shows the **actual GPU compute capability** (e.g., `75` for Tesla T4 which is sm_7.5), not the CUDA toolkit version.
+**Important**: Notice that the `arch` field shows the **actual GPU compute capability** in CUDA `sm_` format (e.g., `sm_75` for Tesla T4 / sm_7.5), not the CUDA toolkit version.

Also applies to: 642-651

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcv/docs/vllm-binary-cache.md` around lines 332 - 348, Update the CUDA
architecture examples and explanatory text to use the "sm_" prefix that
buildBinaryCacheSummary produces (e.g., "arch": "sm_75" instead of "75"); change
the JSON example and all descriptive mentions (including the sentence about
compute capability and any other examples around the same section) to the sm_
format, and ensure references to environment variables remain correct
(VLLM_TARGET_DEVICE for backend, VLLM_MAIN_CUDA_VERSION for CUDA version) so
labels emitted by buildBinaryCacheSummary match the documentation.
mcv/pkg/cache/vllm.go (1)

636-655: ⚠️ Potential issue | 🟠 Major

Populate arch when falling back to cache metadata.

When GPU detection fails, backend and warpSize are replaced from cache metadata/defaults, but arch remains UnknownBackend. That produces a summary target like backend=cuda, arch=UnknownBackend, which will fail architecture compatibility checks later. Add an environment-derived architecture fallback or omit the target when architecture cannot be determined.

🐛 Proposed fix direction
 			if backend == UnknownBackend {
 				logging.Warn("Could not detect GPU on system, using cache metadata as fallback")
 				// Fallback to cache metadata if GPU detection failed
 				backend = binaryCache.TargetDevice
 				if backend == "" {
 					backend = CUDABackend // Default if not specified
 				}
+				if envArch, ok := binaryCache.Env["VLLM_PAGED_ATTN_ARCH"].(string); ok && envArch != "" {
+					arch = envArch
+				}
+				if backend == CUDABackend && arch != "" && !strings.HasPrefix(arch, "sm_") {
+					arch = fmt.Sprintf("sm_%s", arch)
+				}
 				// Set default warp sizes
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcv/pkg/cache/vllm.go` around lines 636 - 655, When falling back because
backend == UnknownBackend, the code updates backend and warpSize but leaves arch
as UnknownBackend which breaks later compatibility checks; update the fallback
block in vllm.go (the section handling UnknownBackend) to also populate arch
from cache or environment (e.g., use binaryCache.TargetArch or an env var) or
clear/omit arch when it cannot be determined, and ensure any defaulting logic
(mapping backend->arch) is applied so the resulting target does not contain
arch==UnknownBackend; change references around backend, arch, warpSize,
UnknownBackend, binaryCache.TargetDevice and CUDABackend accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@mcv/pkg/accelerator/devices/nvml.go`:
- Around line 69-79: SetTritonInfo currently replaces d.tritonInfo but only
merges entries into the existing d.devices map, leaving stale GPUs visible to
GetGPUInfo/GetAllGPUInfo; fix by reinitializing d.devices to an empty map before
the loop in SetTritonInfo (e.g., set d.devices = make(map[int]GPUDevice)) and
then populate it from the provided info slice so the map exactly reflects the
new tritonInfo entries (retain the existing per-device field assignments such as
TritonInfo and leave Summary to be set by SetSummaries).

---

Outside diff comments:
In `@mcv/pkg/cache/vllm.go`:
- Around line 106-116: The current loop treats the torch_aot_compile dir as
Mega-AOT by calling detectMegaAOTEntries (leading to BinaryCacheFormat entries)
and later also detects it as "aot_compile" via detectAOTCompileCache, causing
duplicate manifest entries; change the logic so that when entry.Name() ==
torchAOTCompileDirName you do NOT call detectMegaAOTEntries and instead
skip/continue so only detectAOTCompileCache produces entries. Update the
conditional in the loop (the block referencing torchAOTCompileDirName,
detectMegaAOTEntries, metadata and count) to avoid appending aot artifacts as
BinaryCacheFormat and rely exclusively on detectAOTCompileCache for that layout
(keep torchAOTCompileDirName, detectAOTCompileCache, metadata, and count
references to locate the code).

---

Duplicate comments:
In `@mcv/docs/vllm-binary-cache.md`:
- Around line 14-20: Update the documentation so AOT compile
(VLLM_USE_AOT_COMPILE=1) is consistently described as supported: change the "AOT
Compile Support" paragraph to state it is supported, replace any claims that it
is unsupported with support wording, and ensure the manifest/format table and
any references use cacheFormat: "aot_compile" instead of "binary"; keep the
separate concept/name "Mega AOT" distinct from AOT compile in all sections
(including the format mapping and compatibility notes) and ensure the
preflight/cache validation text references the summary label and limited
hardware metadata for the "aot_compile" format.
- Around line 332-348: Update the CUDA architecture examples and explanatory
text to use the "sm_" prefix that buildBinaryCacheSummary produces (e.g.,
"arch": "sm_75" instead of "75"); change the JSON example and all descriptive
mentions (including the sentence about compute capability and any other examples
around the same section) to the sm_ format, and ensure references to environment
variables remain correct (VLLM_TARGET_DEVICE for backend, VLLM_MAIN_CUDA_VERSION
for CUDA version) so labels emitted by buildBinaryCacheSummary match the
documentation.

In `@mcv/pkg/cache/vllm.go`:
- Around line 636-655: When falling back because backend == UnknownBackend, the
code updates backend and warpSize but leaves arch as UnknownBackend which breaks
later compatibility checks; update the fallback block in vllm.go (the section
handling UnknownBackend) to also populate arch from cache or environment (e.g.,
use binaryCache.TargetArch or an env var) or clear/omit arch when it cannot be
determined, and ensure any defaulting logic (mapping backend->arch) is applied
so the resulting target does not contain arch==UnknownBackend; change references
around backend, arch, warpSize, UnknownBackend, binaryCache.TargetDevice and
CUDABackend accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 6b53339b-63dd-4b5b-99a8-f180789f5388

📥 Commits

Reviewing files that changed from the base of the PR and between eec4df9 and 1702619.

📒 Files selected for processing (10)
  • gkm-codespell.precommit-toml
  • mcv/docs/vllm-binary-cache.md
  • mcv/pkg/accelerator/devices/amd.go
  • mcv/pkg/accelerator/devices/nvml.go
  • mcv/pkg/accelerator/devices/rocm.go
  • mcv/pkg/cache/types.go
  • mcv/pkg/cache/vllm.go
  • mcv/pkg/preflightcheck/triton.go
  • mcv/pkg/preflightcheck/utils.go
  • mcv/pkg/preflightcheck/vllm.go
✅ Files skipped from review due to trivial changes (1)
  • gkm-codespell.precommit-toml
🚧 Files skipped from review as they are similar to previous changes (5)
  • mcv/pkg/accelerator/devices/amd.go
  • mcv/pkg/accelerator/devices/rocm.go
  • mcv/pkg/preflightcheck/triton.go
  • mcv/pkg/preflightcheck/utils.go
  • mcv/pkg/cache/types.go

Comment on lines +69 to +79
// Rebuild devices map from cached triton info
if d.devices == nil {
d.devices = make(map[int]GPUDevice)
}
for _, tritonInfo := range info {
d.devices[tritonInfo.ID] = GPUDevice{
ID: tritonInfo.ID,
TritonInfo: tritonInfo,
// Summary will be set by SetSummaries
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Rebuild devices instead of merging into the existing map.

SetTritonInfo replaces d.tritonInfo, but only merges into d.devices. If this setter runs after a previous probe/cache load, stale GPU entries remain visible through GetGPUInfo/GetAllGPUInfo.

Suggested fix
 	// Rebuild devices map from cached triton info
-	if d.devices == nil {
-		d.devices = make(map[int]GPUDevice)
-	}
+	d.devices = make(map[int]GPUDevice, len(info))
 	for _, tritonInfo := range info {
 		d.devices[tritonInfo.ID] = GPUDevice{
 			ID:         tritonInfo.ID,
 			TritonInfo: tritonInfo,
 			// Summary will be set by SetSummaries
 		}
 	}
+	if len(d.summaries) > 0 {
+		d.SetSummaries(d.summaries)
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Rebuild devices map from cached triton info
if d.devices == nil {
d.devices = make(map[int]GPUDevice)
}
for _, tritonInfo := range info {
d.devices[tritonInfo.ID] = GPUDevice{
ID: tritonInfo.ID,
TritonInfo: tritonInfo,
// Summary will be set by SetSummaries
}
}
// Rebuild devices map from cached triton info
d.devices = make(map[int]GPUDevice, len(info))
for _, tritonInfo := range info {
d.devices[tritonInfo.ID] = GPUDevice{
ID: tritonInfo.ID,
TritonInfo: tritonInfo,
// Summary will be set by SetSummaries
}
}
if len(d.summaries) > 0 {
d.SetSummaries(d.summaries)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcv/pkg/accelerator/devices/nvml.go` around lines 69 - 79, SetTritonInfo
currently replaces d.tritonInfo but only merges entries into the existing
d.devices map, leaving stale GPUs visible to GetGPUInfo/GetAllGPUInfo; fix by
reinitializing d.devices to an empty map before the loop in SetTritonInfo (e.g.,
set d.devices = make(map[int]GPUDevice)) and then populate it from the provided
info slice so the map exactly reflects the new tritonInfo entries (retain the
existing per-device field assignments such as TritonInfo and leave Summary to be
set by SetSummaries).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant